Data visualisation skills (aggregation, interactive visualisations, and geospatial analysis) were used to find properties in the San Francisco market that are viable investment opportunities.
The main tasks is to visualise and analyse the real-estate data in Jupyter notebook, include:
Did any year experience a drop in the average sale price per square foot compared to the previous year?
If so, did the gross rent increase or decrease during that year?
How does the trend in rental income growth compare to the trend in sales prices? Does this same trend hold true for all the neighbourhoods across San Francisco?
What insights can you share with your company about the potential one-click, buy-and-rent strategy that they're pursuing? Do neighbourhoods exist that you would suggest for investment, and why?
# Import the required libraries and dependencies
import pandas as pd
import hvplot.pandas
from pathlib import Path
# Using the read_csv function and Path module, create a DataFrame
# by importing the sfo_neighbourhoods_census_data.csv file from the Resources folder
path = Path("./Resources/sfo_neighbourhoods_census_data.csv")
sfo_data_df = pd.read_csv(path).dropna()
# Review the first and last five rows of the DataFrame
sfo_data_df
| year | neighbourhood | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|
| 0 | 2010 | Alamo Square | 291.182945 | 372560 | 1239 |
| 1 | 2010 | Anza Vista | 267.932583 | 372560 | 1239 |
| 2 | 2010 | Bayview | 170.098665 | 372560 | 1239 |
| 3 | 2010 | Buena Vista Park | 347.394919 | 372560 | 1239 |
| 4 | 2010 | Central Richmond | 319.027623 | 372560 | 1239 |
| ... | ... | ... | ... | ... | ... |
| 392 | 2016 | Telegraph Hill | 903.049771 | 384242 | 4390 |
| 393 | 2016 | Twin Peaks | 970.085470 | 384242 | 4390 |
| 394 | 2016 | Van Ness/ Civic Center | 552.602567 | 384242 | 4390 |
| 395 | 2016 | Visitacion Valley | 328.319007 | 384242 | 4390 |
| 396 | 2016 | Westwood Park | 631.195426 | 384242 | 4390 |
392 rows × 5 columns
groupby function to group the data by year. Aggregate the results by the mean of the groups.¶# Create a numerical aggregation that groups the data by the year and then averages the results.
df_by_year = sfo_data_df.groupby('year').mean()
df_by_year
| sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|
| year | |||
| 2010 | 369.344353 | 372560.0 | 1239.0 |
| 2011 | 341.903429 | 374507.0 | 1530.0 |
| 2012 | 399.389968 | 376454.0 | 2324.0 |
| 2013 | 483.600304 | 378401.0 | 2971.0 |
| 2014 | 556.277273 | 380348.0 | 3528.0 |
| 2015 | 632.540352 | 382295.0 | 3739.0 |
| 2016 | 697.643709 | 384242.0 | 4390.0 |
housing_units_by_year = df_by_year['housing_units']
# Review the DataFrame
housing_units_by_year
year 2010 372560.0 2011 374507.0 2012 376454.0 2013 378401.0 2014 380348.0 2015 382295.0 2016 384242.0 Name: housing_units, dtype: float64
# Create a visual aggregation explore the housing units by year
housing_units_by_year.hvplot.bar(x='year', y='housing_units', xlabel= 'Year', ylabel= 'Housing Units', ylim=(365000,385000)
).opts(yformatter='%.0f', title = 'Housing units in San Francisco from 2010 to 2016')
Question: What is the overall trend in housing_units over the period being analysed?
Answer: From 2010 to 2016, the overall trend in housing units continued to increase
# Create a numerical aggregation by grouping the data by year and averaging the results
prices_square_foot_by_year = df_by_year[['sale_price_sqr_foot','gross_rent']]
# Review the resulting DataFrame
prices_square_foot_by_year
| sale_price_sqr_foot | gross_rent | |
|---|---|---|
| year | ||
| 2010 | 369.344353 | 1239.0 |
| 2011 | 341.903429 | 1530.0 |
| 2012 | 399.389968 | 2324.0 |
| 2013 | 483.600304 | 2971.0 |
| 2014 | 556.277273 | 3528.0 |
| 2015 | 632.540352 | 3739.0 |
| 2016 | 697.643709 | 4390.0 |
prices_square_foot_by_year.gross_rent.min()
1239.0
Question: What is the lowest gross rent reported for the years included in the DataFrame?
Answer: 1239.0
# Plot prices_square_foot_by_year.
# Inclued labels for the x- and y-axes, and a title.
prices_square_foot_by_year.hvplot.line(x='year', y= ['sale_price_sqr_foot','gross_rent'], xlabel= 'Year', ylabel = 'Gross Rent / Sale Price Per Square Foot'
).opts(title = 'Sale Price Per Square Foot and Average Gross Rent - 2010-2016 - San Francisco')
prices_square_foot_by_year DataFrame and interactive plots to answer the following questions:¶Question: Did any year experience a drop in the average sale price per square foot compared to the previous year?
Answer: In 2011, the average sale price per square foot decreased compare to the previous year.
Question: If so, did the gross rent increase or decrease during that year?
Answer: However, in this year (2011), the gross rent increased.
mean of the groups.¶# Group by year and neighbourhood and then create a new dataframe of the mean values
prices_by_year_by_neighbourhood = sfo_data_df.groupby(['year','neighbourhood']).mean()
# Review the DataFrame
prices_by_year_by_neighbourhood.head()
| sale_price_sqr_foot | housing_units | gross_rent | ||
|---|---|---|---|---|
| year | neighbourhood | |||
| 2010 | Alamo Square | 291.182945 | 372560.0 | 1239.0 |
| Anza Vista | 267.932583 | 372560.0 | 1239.0 | |
| Bayview | 170.098665 | 372560.0 | 1239.0 | |
| Buena Vista Park | 347.394919 | 372560.0 | 1239.0 | |
| Central Richmond | 319.027623 | 372560.0 | 1239.0 |
prices_by_year_by_neighbourhood.tail()
| sale_price_sqr_foot | housing_units | gross_rent | ||
|---|---|---|---|---|
| year | neighbourhood | |||
| 2016 | Telegraph Hill | 903.049771 | 384242.0 | 4390.0 |
| Twin Peaks | 970.085470 | 384242.0 | 4390.0 | |
| Van Ness/ Civic Center | 552.602567 | 384242.0 | 4390.0 | |
| Visitacion Valley | 328.319007 | 384242.0 | 4390.0 | |
| Westwood Park | 631.195426 | 384242.0 | 4390.0 |
sale_price_sqr_foot and gross_rent averages per year.¶# Filter out the housing_units
prices_by_year_by_neighbourhood = prices_by_year_by_neighbourhood.drop(['housing_units'], axis = 1)
# Review the first and last five rows of the DataFrame
prices_by_year_by_neighbourhood
| sale_price_sqr_foot | gross_rent | ||
|---|---|---|---|
| year | neighbourhood | ||
| 2010 | Alamo Square | 291.182945 | 1239.0 |
| Anza Vista | 267.932583 | 1239.0 | |
| Bayview | 170.098665 | 1239.0 | |
| Buena Vista Park | 347.394919 | 1239.0 | |
| Central Richmond | 319.027623 | 1239.0 | |
| ... | ... | ... | ... |
| 2016 | Telegraph Hill | 903.049771 | 4390.0 |
| Twin Peaks | 970.085470 | 4390.0 | |
| Van Ness/ Civic Center | 552.602567 | 4390.0 | |
| Visitacion Valley | 328.319007 | 4390.0 | |
| Westwood Park | 631.195426 | 4390.0 |
392 rows × 2 columns
sale_price_sqr_foot and gross_rent. Set the x-axis parameter to the year (x="year"). Use the groupby parameter to create an interactive widget for neighbourhood.¶# Use hvplot to create an interactive line plot of the average price per square foot
# The plot should have a dropdown selector for the neighbourhood
prices_by_year_by_neighbourhood.hvplot.line(x='year', y= ['sale_price_sqr_foot','gross_rent'], groupby= 'neighbourhood',
xlabel= 'Year', ylabel = 'Gross Rent / Sale Price Per Square Foot'
).opts(title = 'Sale Price Per Square Foot and Average Gross Rent - 2010-2016 - By Neighborhood')
Question: For the Anza Vista neighbourhood, is the average sale price per square foot for 2016 more or less than the price that’s listed for 2012?
Answer: For the Anza Vista neighbourhood, the average sale price per square foot for 2016 is less than the price that’s listed for 2012.
neighbourhoods_coordinates.csv file from the Resources folder into the notebook, and create a DataFrame named neighbourhood_locations_df. Be sure to set the index_col of the DataFrame as “Neighbourhood”.¶# Load neighbourhoods coordinates data
neighbourhood_locations_df = pd.read_csv(Path('./Resources/neighbourhoods_coordinates.csv'), index_col = 'Neighbourhood')
# Review the DataFrame
neighbourhood_locations_df
| Lat | Lon | |
|---|---|---|
| Neighbourhood | ||
| Alamo Square | 37.791012 | -122.402100 |
| Anza Vista | 37.779598 | -122.443451 |
| Bayview | 37.734670 | -122.401060 |
| Bayview Heights | 37.728740 | -122.410980 |
| Bernal Heights | 37.728630 | -122.443050 |
| ... | ... | ... |
| West Portal | 37.740260 | -122.463880 |
| Western Addition | 37.792980 | -122.435790 |
| Westwood Highlands | 37.734700 | -122.456854 |
| Westwood Park | 37.734150 | -122.457000 |
| Yerba Buena | 37.792980 | -122.396360 |
73 rows × 2 columns
sfo_data_df Dataframe, create a DataFrame named all_neighbourhood_info_df that groups the data by neighbourhood. Aggregate the results by the mean of the group.¶# Calculate the mean values for each neighbourhood
all_neighbourhood_info_df = sfo_data_df.groupby('neighbourhood').mean()
# Review the resulting DataFrame
all_neighbourhood_info_df.drop('year', axis = 1, inplace = True)
all_neighbourhood_info_df
| sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|
| neighbourhood | |||
| Alamo Square | 366.020712 | 378401.00 | 2817.285714 |
| Anza Vista | 373.382198 | 379050.00 | 3031.833333 |
| Bayview | 204.588623 | 376454.00 | 2318.400000 |
| Bayview Heights | 590.792839 | 382295.00 | 3739.000000 |
| Bernal Heights | 576.746488 | 379374.50 | 3080.333333 |
| ... | ... | ... | ... |
| West Portal | 498.488485 | 376940.75 | 2515.500000 |
| Western Addition | 307.562201 | 377427.50 | 2555.166667 |
| Westwood Highlands | 533.703935 | 376454.00 | 2250.500000 |
| Westwood Park | 687.087575 | 382295.00 | 3959.000000 |
| Yerba Buena | 576.709848 | 377427.50 | 2555.166667 |
73 rows × 3 columns
neighbourhood_locations_df DataFrame with the all_neighbourhood_info_df DataFrame.¶Note that the first cell uses the Pandas concat function to create a DataFrame named all_neighbourhoods_df.
The second cell cleans the data and sets the “Neighbourhood” column.
Be sure to run these cells to create the all_neighbourhoods_df DataFrame, which you’ll need to create the geospatial visualisation.
# Using the Pandas `concat` function, join the
# neighbourhood_locations_df and the all_neighbourhood_info_df DataFrame
# The axis of the concatenation is "columns".
# The concat function will automatially combine columns with
# identical information, while keeping the additional columns.
all_neighbourhoods_df = pd.concat(
[neighbourhood_locations_df, all_neighbourhood_info_df],
axis="columns",
sort=False
)
# Review the resulting DataFrame
display(all_neighbourhoods_df.head())
display(all_neighbourhoods_df.tail())
| Lat | Lon | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|
| Alamo Square | 37.791012 | -122.402100 | 366.020712 | 378401.0 | 2817.285714 |
| Anza Vista | 37.779598 | -122.443451 | 373.382198 | 379050.0 | 3031.833333 |
| Bayview | 37.734670 | -122.401060 | 204.588623 | 376454.0 | 2318.400000 |
| Bayview Heights | 37.728740 | -122.410980 | 590.792839 | 382295.0 | 3739.000000 |
| Bernal Heights | 37.728630 | -122.443050 | NaN | NaN | NaN |
| Lat | Lon | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|
| Yerba Buena | 37.79298 | -122.39636 | 576.709848 | 377427.5 | 2555.166667 |
| Bernal Heights | NaN | NaN | 576.746488 | 379374.5 | 3080.333333 |
| Downtown | NaN | NaN | 391.434378 | 378401.0 | 2817.285714 |
| Ingleside | NaN | NaN | 367.895144 | 377427.5 | 2509.000000 |
| Outer Richmond | NaN | NaN | 473.900773 | 378401.0 | 2817.285714 |
# Call the dropna function to remove any neighbourhoods that do not have data
all_neighbourhoods_df = all_neighbourhoods_df.reset_index().dropna()
# Rename the "index" column as "Neighbourhood" for use in the Visualisation
all_neighbourhoods_df = all_neighbourhoods_df.rename(columns={"index": "Neighbourhood"})
# Review the resulting DataFrame
display(all_neighbourhoods_df.head())
display(all_neighbourhoods_df.tail())
| Neighbourhood | Lat | Lon | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|---|
| 0 | Alamo Square | 37.791012 | -122.402100 | 366.020712 | 378401.0 | 2817.285714 |
| 1 | Anza Vista | 37.779598 | -122.443451 | 373.382198 | 379050.0 | 3031.833333 |
| 2 | Bayview | 37.734670 | -122.401060 | 204.588623 | 376454.0 | 2318.400000 |
| 3 | Bayview Heights | 37.728740 | -122.410980 | 590.792839 | 382295.0 | 3739.000000 |
| 5 | Buena Vista Park | 37.768160 | -122.439330 | 452.680591 | 378076.5 | 2698.833333 |
| Neighbourhood | Lat | Lon | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|---|
| 68 | West Portal | 37.74026 | -122.463880 | 498.488485 | 376940.75 | 2515.500000 |
| 69 | Western Addition | 37.79298 | -122.435790 | 307.562201 | 377427.50 | 2555.166667 |
| 70 | Westwood Highlands | 37.73470 | -122.456854 | 533.703935 | 376454.00 | 2250.500000 |
| 71 | Westwood Park | 37.73415 | -122.457000 | 687.087575 | 382295.00 | 3959.000000 |
| 72 | Yerba Buena | 37.79298 | -122.396360 | 576.709848 | 377427.50 | 2555.166667 |
points plot for the all_neighbourhoods_df DataFrame. Be sure to do the following:¶geo parameter to True.size parameter to “sale_price_sqr_foot”.color parameter to “gross_rent”.frame_width parameter to 700.frame_height parameter to 500.# Create a plot to analyse neighbourhood info
all_neighbourhoods_df.hvplot.points(
'Lon',
'Lat',
geo = True,
size = 'sale_price_sqr_foot',
color = 'gross_rent',
frame_width = 700,
frame_height = 500,
tiles = True,
#alpha = 0.3,
hover_cols = ['Neighbourhood']
)
/Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(multi_line_string) > 1: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. for line in multi_line_string: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(p_mline) > 0: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(multi_line_string) > 1: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. for line in multi_line_string: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(p_mline) > 0: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:245: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(multi_line_string) > 1: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:297: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry. for line in multi_line_string: /Users/wanlin/opt/anaconda3/envs/alpacaenv/lib/python3.7/site-packages/cartopy/crs.py:364: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the number of parts of a multi-part geometry. if len(p_mline) > 0:
Question: Which neighbourhood has the highest gross rent, and which has the highest sale price per square foot?
Answer: The neighbourhood Westwood Park has the highest gross rent. The neighbourhood Union Square District has the highest sale price per square foot.
Based on the visualisations that you created, answer the following questions:
Question: How does the trend in rental income growth compare to the trend in sales prices? Does this same trend hold true for all the neighbourhoods across San Francisco?
Answer: Except for 2011, the sales prices showed a small increase from 2010 to 2016, while rental income growth has been showing strong growth. Although the growth rate of rental income growth slowed in 2011 and 2015, the rental income growth still increased from the previous year. The rental income growth maintained the same trend for all the neighbourhood across San Franciso. However, the sale price showed various state of volatility.
Question: What insights can you share with your company about the potential one-click, buy-and-rent strategy that they're pursuing? Do neighbourhoods exist that you would suggest for investment, and why?
Answer: With consistently rising rents and relatively stable housing prices, buying is a good strategy. However, since housing price fluctuations vary from region to region, choosing a quality area to buy becomes the key of the strategy.
Silver Terrace will be a good choice for buying, since it has relatively lower sale_price_sqr_foot value (much smaller point size in GeoViews) and higher gross_rent (much deeper point color in GeoViews). It is worth noting that among the dataset obtained, Silver Terrace only contains data for 2014, and more data needs to be collected to further verify the feasibility of the strategy.
sfo_data_df.head()
| year | neighbourhood | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|
| 0 | 2010 | Alamo Square | 291.182945 | 372560 | 1239 |
| 1 | 2010 | Anza Vista | 267.932583 | 372560 | 1239 |
| 2 | 2010 | Bayview | 170.098665 | 372560 | 1239 |
| 3 | 2010 | Buena Vista Park | 347.394919 | 372560 | 1239 |
| 4 | 2010 | Central Richmond | 319.027623 | 372560 | 1239 |
sfo_data_df[sfo_data_df.neighbourhood == 'Silver Terrace']
| year | neighbourhood | sale_price_sqr_foot | housing_units | gross_rent | |
|---|---|---|---|---|---|
| 280 | 2014 | Silver Terrace | 170.292549 | 380348 | 3528 |